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Abstract 


n-d  Artsy  Summation  Capability 


“Embedded  software  processing  requirements  for  DSP,  especially  for  radar,  are  ex¬ 
pected  to  exceed  1  x  1012  operations  per  second  within  five  years  [3].”  Therefore, 
the  efficient  use  of  memory  at  all  levels  of  the  hierarchy  is  essential.  These  array 
based  computations  involve  the  composition  of  linear  and  multi-linear  operators. 
Previous  work  illustrated  how  a  general  array  algebra  (MoA),  and  a  “suitably  rich 
compatible  index  calculus  [3]”  (Psi-Calculus),  could  be  used  to  develop  software 
for  radar  and  other  DSP  applications.  This  software  needs  to  be  tuned  to  use  the 
levels  of  memory  hierarchies  efficiently  without  the  materialization  of  array  val¬ 
ued  temporaries  [3].  Monolithic  compiler  experiments  presented  in  [4]  illustrated 
how  these  theories  could  be  mechanized  using  expression  templates  in  C++.  The 
present  work  continues  these  investigations  by  defining  an  N-dimensional  array 
class  with  shape  in  order  to  support  the  mechanization  of  linear  transformations 
in  the  Psi-Calculus  00-Calculus).  We  show  that  this  class  extends  the  support  for 
array  operations  in  the  Portable  Expression  Template  Engine  (<  PETE>)  while 
offering  performance  that  is  competitive  with  hand  coded  C.  Such  extensions  are 
needed  to  support  the  dimension  lifting  which  maps  arrays  to  all  levels  of  a  mem¬ 
ory/processor  hierarchy. 

Keywords:  embedded  digital  systems,  radar,  signal  processing,  arrays, 
high  performance,  index  calculus,  shapes,  psi,  MoA. 
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Figure  1:  Array  class  with  shape  for  N-dimensional  support. 
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Introduction 

Motivating  this  paper  is  the  development  of  efficient  algorithms  for  radar  and  more 
generally  DSP  applications.  “Reasoning  about  radar,  from  a  computational  per¬ 
spective,  entails  reasoning  about  the  data  structures  underlying  the  algorithms  for 
radar  computations  [3].”  Arrays  are  the  data  structures  underlying  algorithms  for 
radar  computations.  These  algorithms  are  characterized  by  linear  and  multi-linear 
matrix  operations.  Therefore,  a  high  level  array  algebra  can  facilitate  an  efficient, 
scalable  and  portable  algorithm  design.  “Consequently,  we  believe  that  the  future 
development  of  efficient,  scalable,  portable  algorithms,  for  radar,  more  generally 
for  DSP  applications,  will  be  greatly  facilitated  by  the  use  of  a  high-level  array 
algebra  during  algorithm  design.  Additionally,  since  program  efficiency  depends 
critically  upon  the  efficient  use  of  memory/processor  hierarchies,  this  array  al¬ 
gebra  should  be  combined  with  a  suitably  powerful  index  calculus.  This  calculus 
should  facilitate  data  layout,  movement,  and  manipulation  at  all  levels  of  the  mem¬ 
ory/processor  hierarchy.  [3].”  In  [5,  1]  it  is  shown  that  MoA  and  0-Calculus  are 
suitable  for  such  an  algebra  and  calculus.  For  example,  [1]  presents  in  detail  how 
MoA  and  the  0-Calculus  can  be  integrated  into  a  Time  Domain  (TD)  convolution 
algorithm.  The  algorithm  development  presented  in  [1]  entailed  array  dimension 
lifting  and  data  restructuring.  These  were  driven  by  the  memory /processor  hierar¬ 
chy,  coincident  with  array  decompositions  and  layouts.  This  process  was  shown  to 
minimize  temporary  array  materializations  using  these  theories.  Prior  to  this,  com¬ 
piler  experiments  were  presented  in  [4]  which  demonstrated  the  machanization  of 
these  theories  via  expression  templates  in  C++. 
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Synthetic  aperture  radar  (SAR)  has  many  application  (i.e.  target  detection,  contin¬ 
uous  observation  of  dynamic  phenomena  and  classification  of  vegetation  [6]).  This 
is  due  to  its  high  resolution  imaging  capabilities  under  varying  conditions  (day  and 
night,  all-weather).  Various  SAR  signal  processing  methods  have  been  developed 
(i.e.  spectral  analysis  (FFT)  and  frequency  domain  convolution).  However,  the 
time-domain  (TD)  analysis  is  the  simplest  and  most  accurate  algorithm  for  SAR 
signal  processing  [6].  The  TD  algorithm  is  also  the  most  computational  intensive 
which  makes  it  useful  only  with  SAR  data  of  limited  size  [6].  Consequently,  faster 
TD  algorithms  are  needed  as  the  size  and  resolution  requirements  increase. 

TD  Convolution:  MoA  Design 

In  [4]  a  C++  vector  class  to  define  the  TD  convolution  was  presented.  The  related 
experiments  showed  that  the  creation  of  array  valued  temporaries  could  be  avoided, 
enabling  performance  competitive  with  hand  coded  C. 

A  uni-processor  using  vector  arguments  on  the  vector  class  presented  in  [4] 
was  sufficient  for  those  experiments.  However,  to  support  mapping  to  processor 
memory  hierarchies,  vector  arguments  must  be  algebraically  abstracted  to  higher 
dimensional  arrays.  When  processors  are  added  to  the  design,  the  dimension  of  the 
problem  is  lifted  up.  Adding  a  cache  loop  adds  yet  another  dimension.  Thus,  we 
started  with  a  1  -dimensional  problem,  then  abstracted  the  computation  to  a  second 
(time)  dimension.  Adding  processor  and  cache  mapping  ultimately  resulted  in  a  4- 
dimensional  problem.  In  addition,  if  we  desire  to  support  3-dimensional  or  higher 
array  arguments,  dimension  lifting  may  require  10  or  more  dimensions.  Such  a 
high  dimensionality  is  not  typically  supported  in  today’s  languages  or  libraries. 
Figure  1  illustrates  our  ability  to  do  so. 

Shape  :  a  new  class 

<  PETE>  facilitates  the  creation  of  optimized  C++  code  to  do  various  mathe¬ 
matical  operations.  However,  <  PETE>  operates  on  scalars  and  does  not  provide 
an  interface  for  multi-dimensional  array  computations  that  are  required  for  the  0- 
Calculus.  In  addition  to  the  dimension  limitation,  <  PETE>  does  not  support 
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Array,  h 

template  <class  T  =  int> 
class  Array 

{ 

■  ■  ■ 

template<class  RHS> 

Array  &operator=(const  Expression<RHS>  &rhs) 

{ 

for(long  i=0;  i<this->size;  i++) 

d[i]  =  forEach(rhs,  EvalLeafl(i),  OpCombineO); 
return  *this;  //equivalent  to:  a.d[i]  =  b.d[i]+c.d[i]+d.d[i] 

} 

■  ■  ■ 

private: 

T  *  d; 

vector  <int>  shape; 
long  size; 
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the  shape  notion  which  is  a  component  used  to  calculate  the  attribute  evaluation 
rules  needed  to  rewrite  an  Abstract  Syntax  Tree  (AST)  defining  array  expressions. 
Fortunately,  <  PETE>  is  designed  to  be  extendable.  To  that  end,  we  have  imple¬ 
mented  a  multi-dimensional  array  object  extension  to  support  shape.  The  shape 
vector  uses  the  Standard  Template  Library  (STL)  vector<int>  class  which 
conveniently  enables  the  needed  N-dimensionality.  The  shape  vector  is  passed  to  a 
specialized  Array  class  which  constructs  the  multi-dimensional  array  and  enables 
assignments  and  arithmetic  using  operator  overloading  and  expression  templates 
respectively. 

Our  experiments  test  the  efficiency  of  our  implementation  and  show  that  ours 
is  much  faster  than  a  standard  C-*"-*-  implementation  and  is  similar  in  speed  to  an 
implementation  in  hand-coded  C.  It  also  shows  that  the  N-dimensional  function¬ 
ality  of  our  Array  class  does  not  increase  the  overhead  compared  to  a  typical 
PETE  implementation. 
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Implementation 

In  a  previous  paper  [4],  experimental  results  were  presented  for  computing  1- 
Dimensional  Arrays.  To  demonstrate  our  efforts  to  extend  these  ideas  to  multi¬ 
dimensional  arrays,  we  present  experimental  results  of  our  multi-dimensional  Array 
object  implementation.  The  class  is  templated  and  therefore  supports  any  data 
type.  This  is  necessary  so  that  we  can  use  common  scientific  computing  data  types 
such  as  float  and  double.  The  experiments,  therefore,  test  the  integer  and 
float  data  types  for  basic  math  operations  which  are  fundamental  to  '0-Calculus 
(i.e.  distributing  indexing  over  scalar  operations). 

The  results  presented  here  are  for  the  addition1  of  three  multi-dimensional 
arrays  and  the  assignment  of  their  result  to  a  forth  multi-dimensional  array. 

The  Array  is  defined  through  a  shape  vector.  This  vector  stores  the  size 
of  each  dimension  of  an  Array  and  is  passed  to  the  Array  class  constructor. 
Default  and  copy  constructors  are  also  defined.  This  class  incorporates  operator 
overloading  and  <  PETE>  related  expression  tree  definitions.  Our  class  gets  its 
efficiency  by  defining  and  implementing  the  high  level  operation  of  the  expression 
template.  This  allows  our  class  to  interface  with  <  PETE>,  utilizing  its  standard 
mathematical  operations  and  expression  tree  operation  evaluation  ordering.  The 
shape  vector  mentioned  above  is  a  relevant  feature  of 'll? -Calculus. 

Essentially,  the  -0-Calculus  rules  can  be  implemented  at  the  iterator  level. 
Basically,  we  have  complex  non-algebraic  array  expressions  that  can  be  reduced 
to  memory  access  patterns.  The  multiple  levels  of  indirection  can  be  handled  by 
the  iterator  abstraction  (the  pointer  on  steroids). 

Experiments 

Our  tests  were  compiled  on  two  COTS  platforms:  an  800  MHz  Pentium  III  pro¬ 
cessor  with  320MB  of  memory  running  Redhat  Linux  7.2  and  a  200  MHz  IBM 
PowerPC  with  4GB  of  memory  running  AIX  Version  5.  The  test  code  was  com¬ 
piled  using  Intel  C~*"+  and  GCC  respectively. 

Figure  2  and  3  show  our  results  which  measure  the  differences  between  six 
implementations  of  multi-dimensional  array  addition2:  C+-*"  (static,  int). 
Array  class  (dynamic,  float).  Array  class  (dynamic,  int),  <  PETE> 
(dynamic,  int),  C  (dynamic,  int)  and  C  (static,  int). 

The  results  were  similar  on  both  platforms.  Specifically,  the  C  (static, 
int)  version  performed  the  fastest  and  the  C+~*~  (static,  int)  version  per¬ 
formed  the  slowest.  It  is  clear  that  the  Object  Oriented  Programming  (OOP)  con¬ 
structs  of  C++  affect  performance.  However,  it  is  well  known  that  OOP  constructs 
improve  programmability  (ease  of  use,  extendibility,  reusability  and  quality  [2]) 
over  C,  in  particular  when  extending  the  implementation  to  complex  applications. 
With  <  PETE>  and  our  C++  Array  class,  we  achieved  similar  performance  as 
is  obtained  using  C  where  the  loops  are  optimized  by  hand.  This  is  shown  by  the 
results  using  the  integer  type  which  performed  the  same  as  the  pure  <  PETE> 
coded  results.  However,  the  float  data  type  version  was  still  significantly  faster 
than  the  traditional  C++  (static,  int)  implementation.  These  results  further 
validate  that  we  can  integrate  the  high  performance  of  optimized  C  loops  into  the 
00/C++  paradigm. 

Overall,  these  experiments  show  that  extending  <  PETE>  to  N-Dimensional 
Array  operations  via  the  -0-Calculus  shape  notion  is  viable.  The  speed  of  the 
computations  is  impressive  even  when  using  templated  types  including  the  float 
data  type. 

’All  <  PETE>  scalar  operations  are  included. 

2This  could  be  any  binary  scalar  operation:  divide,  ceiling,  etc. 


Figure  2:  Five  implementations  of  multi-dimensional  addition. 
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Conclusion  and  Future  Work 

The  results  shown  demonstrate  the  viability  of  using  <  PETE>  as  a  means  to 
optimize  operations  that  are  essential  to  a  fully  functional  0-Calculus  library.  They 
represent  a  very  important  step:  inclusion  of  the  shape  notion  into  the  Array 
class. 

These  results  are  encouraging.  Future  work  may  be  in  reducing  the  detri¬ 
mental  performance  caused  by  templating  the  Array  type  and  adding  additional 
algorithm  methods  to  enable  0-Calculus  operations. 
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